New York House Prices

Group L12 G03

Cifti Saggu, Daniella Jaqin, Hamza Ahmed, Zichen Liu and Mike Xu

Data Description

Introduction

  • Random sample of data from a county in New York

  • It’s sourced from the Data And Story Library (DASL).

  • Clean, since no missing variables .

  • Dependent variable → price which represents the sales price of each house

  • Independent variables → age, land value, living area, bathrooms, and rooms

Data Description

Variables

  • Age - The age of the house, typically measured from the year of construction till current (2023 in this case).
  • Land Value (USD) - The assessed value of the land.
  • Living Area (sq ft) - The size of the interior living space of the house.
  • Bathrooms - This includes full bathrooms and half-baths.
  • Number of Rooms - This represents the total number of rooms in the house.

Data Description

2. Categories

  • Numerical Continuous: Age, Land Value, Living Area
  • Numerical Discrete: Bathrooms, Number of Rooms

Source: Depositphotos

Appropriate model selection - Mike

  • Goal: Predict house prices based on several property characteristics.

  • How did we do this? Focus on multiple regression by selecting the best variables for the predictive model

  • What models we compared? The forward and backward step wise selection and exhaustive search model.

Source: iStock

Appropriate model selection

These methods helped identify the most relevant property features that contribute to accurate price predictions.

Multiple regression model:

  • Reliable predictions

  • Clearer insights into house features and prices

  • Simple and Interpretable

Assumptions

1. Linearity

Given the correlation plots we have chosen the variables:age, land value, living area, bathrooms and rooms

Assumptions

2. Independance

Figure 1: Residual Plots for Predictors vs Model

Assumptions

3. Homoskedasticity

Figure 2: Residual Plot of Chosen Variables

Breusch-Pagan test checks for heteroskedasticity and since the p-value (0.196) is > 0.05, this suggests there is limited evidence of heteroskedasticity. Therefore, the residuals indicate that the assumption of constant variance holds.

Assumptions

4. Normality

Figure 3: Q-Q Plot of Linear Numeric Variables

Comparing Models

Model Comparison Based on Key Criteria
Model Accuracy Scalability Interpretability Simplicity
Forward Selection ✔ MAE: 0.21454 RMSE: 0.33068 R²: 0.5586 ✔ Adapts well to new data ✔Straightforward ✔ Highlights Key Drivers AIC: 792.91 BIC: 831.12
Backward Selection ✔ MAE: 0.21454 RMSE: 0.33068 R²: 0.5586 ✘ Overfit risk on larger data ✘ Complex interactions ✘ Complex to implement AIC: 792.91 BIC: 831.12
Exhaustive Search ✔ MAE: 0.21702 RMSE: 0.33412 R²: 0.5467 ✘ Expensive Computationally ✘ Difficult to explain ✘ complex & overwhelming AIC: 835.16 BIC: 862.46

Model Outputs

Forward Model

Backward Model

Why Forward Model is Best

  • Ideal for stakeholders: Highlights key variables impacting price and utilizes new data only when it improves accuracy
  • Avoids unnecessary, complex interactions: perfect for non-technical audiences
  • Mirrors real-world property assessment, starting from basics (size) to more specific features (bedrooms, bathrooms, age).

Limitations

  • Nature of Data set:
    • Most of initial data set was categorical, limiting the continuous predictors used in Model.

    • Due to it being specific to the New York setting, it may not be applicable to other states or countries.

    • Variables with non-linear relationship cannot be included as a result limiting the features of a house that can be compared.

Future Improvements

Improve Data set: Include location or economic indicators(inflation, interest rates) and add more continuous predictors to give finer granularity

Consider Non-Linear Models (e.g., decision trees, random forests) to capture complex relationships without sacrificing interpretability.

Conclusion

  • The data used in this analysis is a random sample of houses taken from a New York County
  • Our goal: To find the extent to which numeric features of a house impact its price in the sample provided of houses in New York

Source: iStock

Key Findings

  • We used the multiple regression workflow to deduce an equation which assists in identifying to what extent numeric features impact the price of a house.

log(Price)= 0.0002983×(living area) + 0.000003596×(land value) + 0.1107×(bathrooms) − 0.001356×(age) + 0.009131×(rooms)

Key Findings

log(Price)= 0.0002983×(living area) + 0.000003596×(land value) + 0.1107×(bathrooms) − 0.001356×(age) + 0.009131×(rooms)

Feature Effect (Increase/Decrease) % Effect on Price
Bathrooms (number of bathrooms) Increase 11.07%
Rooms (number of rooms) Increase 0.91%
Living Area (square feet) Increase 0.03%
Age (years) Decrease 0.14%
Land Value (US dollars) Increase 0.0004%

Why People Should Care

  • Understanding the features and factors is critical for us all as potential future homeowners, but also for investors, real estate agents, policymakers, and developers who rely on this knowledge to make decisions.

  • Although this data is from New York, it highlights the key features of a house that can lead to significant price differences.

Final Takeaway

  • Bathrooms had the biggest impact on price with the number of rooms following second.

  • A house’s age, land value and living area all had a small impact as well with age being the only one that led to a decrease in price.

  • Thank you for your attention and we hope this assisted in your home buying decisions!